Text Recognition (TR) technology leverages a range of deep learning techniques to analyze and identify characters and words embedded in images. Its scope encompasses handwritten, printed, and scene text recognition. In this paper, we take a holistic approach, treating these categories as a unified challenge to delve into the complexities associated with TR comprehensively. The state-of-the-art models predominantly rely on vision encoder--decoder (VED) transformer architectures. However, these models tend to be bulky, housing a multitude of parameters, which not only engender significant memory consumption but also lead to sluggish inference times due to their autoregressive nature. It is essential to note that these issues primarily stem from the decoder component. Consequently, our study aims to introduce an efficient workflow that substitutes the language modeling capabilities of the decoder with lightweight Mixer layers trained using Connectionist Temporal Classification. By following this approach, we unveil three decoder-free architectures that reduce the number of parameters by a 74.3%, trim down the necessary training memory by a 53.8%, and enhance inference times with an average speedup factor of 20 when compared to their VED counterparts. In terms of results, our workflow yields models that are on par or better than the state of the art across six databases encompassing historical and modern handwritten, printed, and scene text recognition. We are committed to making our code accessible on GitHub.